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An excavate root for the eukaryote tree of life 
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Much of the higher-order phylogeny of eukaryotes is well resolved, but the root remains elusive. We assembled 
a dataset of 183 eukaryotic proteins of archaeal ancestry to test this root. The resulting phylogeny identifies four 
lineages of eukaryotes currently classified as “Excavata” branching separately at the base of the tree. Thus, Para- 
basalia appear as the first major branch of eukaryotes followed sequentially by Fornicata, Preaxostyla, and 
Discoba. All four excavate branch points receive full statistical support from analyses with commonly used evo- 
lutionary models, a protein structure partition model that we introduce here, and various controls for deep phy- 
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logeny artifacts. The absence of aerobic mitochondria in Parabasalia, Fornicata, and Preaxostyla suggests that 
modern eukaryotes arose under anoxic conditions, probably much earlier than expected, and without the 


benefit of mitochondrial respiration. 


INTRODUCTION 
The major outlines of the eukaryote tree of life are beginning to co- 
alesce, largely through a combination of multilocus phylogeny 
(phylogenomics) and the discovery of unique taxa filling in long- 
standing gaps in the tree (1). However, the root of the tree 
remains uncertain. One taxon set of particular interest is the so- 
called Excavata, a diverse collection of almost exclusively single- 
celled eukaryotes, many of which ingest their food via a deep (exca- 
vated) feeding groove (2). Taxa assigned to this group include 
Discoba, Fornicata, Preaxostyla, and Parabasalia (3). Discoba are 
mostly aerobic unicells, but with exceptionally diverse mitochon- 
dria and mitochondrial DNA (4). The remaining excavates are re- 
stricted to anaerobic or low-oxygen environments and have only 
what appear to be degenerate mitochondria-derived organelles [mi- 
tosomes or hydrogenosomes; (5)], if any at all (6). These “anaerobic 
excavates” or Metamonada (1, 3) are strictly unicellular with 
diverse, often notable, morphologies and are likely to be major 
players in anoxic environments such as marine sediments, which 
are among the largest and least explored planetary ecosystems (7, 
8). The best studied anaerobic excavates are parasites, e.g., 
Giardia and Trichomonas, but the true diversity and distribution 
of the various excavate lineages are poorly understood (9). The phy- 
logenetic position of these taxa is also poorly defined, and their pre- 
sumed monophyly has not been tested in a rooted multigene tree. 
Phylogenetic rooting most often relies on an outgroup. Because 
over half of the universal eukaryotic genes are derived from Bacteria 
or Archaea (10), the eukaryote tree can potentially be rooted with 
either archaeal or bacterial homologs. However, these genes are 
quite different. Universal eukaryotic genes of bacterial ancestry 
(euBacs) tend to be associated with mitochondria-related functions, 
while genes of archaeal ancestry (euArcs) tend to be involved in in- 
formation processing such as replication, transcription, translation 
and protein modification, sorting, and degradation (10). In previous 
work, we showed that euBac phylogeny places Discoba as the sister 
group to all actively mitochondriate eukaryotes (11). However, 
these analyses excluded the anaerobic excavates, because they lack 
mitochondrial DNA and most mitochondrially targeted euBacs. By 
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contrast, all eukaryotes have most euArcs, and therefore, we turned 
to euArc sequences to test the position of the anaerobic excavates in 
the eukaryote tree of life. 

Evolutionary models are an integral part of molecular phyloge- 
ny. These models consist of parameters for substitution processes 
that are either estimated from the data or empirically estimated 
from highly curated external data [e.g., (12)]. One of the most influ- 
ential recent advances in phylogenetic modeling is category profile 
mixture (CAT) models, which calculate the site likelihood for each 
alignment column as the weighted mean over all observed substitu- 
tion patterns in the alignment. This has the attractive feature of 
showing a good fit to the data but at great expense in terms of 
demand on computation time and memory as well as involving a 
certain degree of circularity. Moreover, it has long been known 
that, at least for protein sequences, amino acid substitution patterns 
are primarily constrained by relatively simple patterns of protein 
secondary structure and solvent accessibility (12-14). These 
factors can be accounted for in phylogenetic analysis either with 
protein mixture models, which try to capture the structure based 
on site-wise amino acid composition, or with partition models, 
where site likelihood is calculated on the basis of a prior known 
structure. However, the latter is rarely used in practice because 
the relevant structural information is often unavailable or challeng- 
ing to work with. 

We have assembled a dataset of 183 euArc proteins and used 
these to explore the position of the eukaryote root. The data 
include broad and deep taxonomic sampling of eukaryotes and 
Archaea, including new and/or newly assembled public data for 
31 excavate taxa. To facilitate analyses of the data, we developed a 
method to use predicted protein secondary structure and solvent 
accessibility to partition the data into the six main structure- 
solvent categories and then analyzed the data with substitution ma- 
trices specific for each site category. The model is simple and fast 
and provides a good fit to the data. Our analyses of a concatenation 
of the 183 euArc proteins using a variety of phylogenetic models, 
including the protein structure partition model, and various con- 
trols for deep phylogeny artifacts strongly and consistently 
support an excavate root for the eukaryote tree of life. 
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RESULTS 

Eukaryotes can be roughly divided into six major divisions or su- 
pergroups. These are Amorphea (including animals and fungi), Di- 
aphoretickes (including plants, most algae, and numerous and 
diverse eukaryotic microbes, saprobes, and parasites), one major 
lineage of aerobic excavates (Discoba), and three lineages of anaer- 
obic excavates (Parabasalia, Preaxostyla, and Fornicata) (table S1) 
(1, 3). Sequences for a broad taxon sampling of the six supergroups 
plus most major divisions of Archaea were extracted from public 
data for 183 euArc protein orthologs, identified by iterative screen- 
ing of an intersection between the archaeal and eukaryotic clusters 
of orthologous genes (arCOG and KOD, respectively) databases 
(15). Because publicly available assembled data for excavates are 
sparse, we further augmented the excavate data with four in- 
house transcriptomes and 27 in-house assembled public se- 
quence-read-archive (SRA) files (table S1). 

To facilitate the analysis of this large and complex dataset and to 
address a very deep evolutionary question, we developed a simpli- 
fied procedure to use protein secondary structure to model the un- 
derlying evolutionary process. This procedure benefits from a 
recently developed deep learning method that predicts protein sec- 
ondary structure and solvent accessibility with high speed and ac- 
curacy (16). In the procedure used here, individual protein 
structures are predicted based on full-length sequences for 10 taxa 
from across the dataset, with the resulting predicted structures 
mapped back to trimmed alignments. This allows the data to be par- 
titioned, according to a majority consensus of the predicted struc- 
tures, into six site categories corresponding to buried and exposed 
helices, sheets, and loops (data S1). The resulting partitioned matrix 
is then analyzed in a maximum likelihood framework using prede- 
termined substitution matrices for each site category (12). 

Phylogenetic analysis of a concatenation of the 183 euArc ortho- 
logs for 186 taxa (tables S1 to S3) produces a single well-resolved 
phylogeny of eukaryotes with 99 to 100% bootstrap support for 
all eukaryote supergroups [indicated by labels in Fig. 1, table S1, 
and (3)]. The rooted phylogeny places the four excavate lineages 
separately as the first four major branches in the eukaryote 
subtree, with 100% bootstrap support for each excavate branch 
point (Fig. 1 and figs. S1 to S3). The first of these branches is Para- 
basalia, which is then followed sequentially by Fornicata, Preaxos- 
tyla, and, lastly, Discoba as sister group to a clade of Amorphea + 
Diaphoretickes. This topology is fully reconstructed with a variety 
of evolutionary models, all with four gamma rate categories: the C20 
and C60 profile mixture models (17), protein structure mixture 
models [EX2, EHO, and EX_EHO + G; (12)], and the protein struc- 
ture-based partition model introduced here (Fig. 1, figs. S1 to $3, 
and data S1 and S2). In all cases, the trees show full support 
(100% bootstrap) for the four excavate branch points and all eukary- 
ote supergroups with the exception of the internal branching order 
within Diaphoretickes subgroup SAR (Stramenopila + Alveolate + 
Rhizaria; Fig. 1). 

A comparison of fit among the various evolutionary models 
shows that all models produce a comparable fit to these data, with 
structure-based models providing nearly as good a fit as the C20 and 
C60 profile mixture models. However, the structure models do this 
with a fraction of the complexity (number of categories) and run 
time, with the structure partition model introduced here being 
the simplest and fastest (Fig. 2). The relative simplicity of the 
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partitioned structure models should also reduce the risk of overfit- 
ting (18), as well as having the obvious advantage of reducing the 
demand on analytical time and resources (Fig. 2). For example, full 
analyses of the 183-protein euArc data with the structure partition 
model required 46 Central Processing Unit (CPU) hours and 5 GB 
of memory. This is compared to more than 1000 hours and 106 GB 
of memory for analyses with the C20 mixture model and over 3200 
hours and 338 GB for the C60 model on a 20-core supercomputer 
array with 500 GB of memory. The predicted structure partition 
model described here also obviates the need for solved protein 
structures, which are mostly available for only one or a few taxa 
per protein, if any at all. Thus, predicted structures can be generated 
for multiple taxa from across the data and then used to calculate a 
consensus predicted structure reflecting the full dataset. Moreover, 
the structural information only needs to be calculated once and then 
can be reused for additional analyses such as gene-wise or taxon- 
wise jackknifing or other controls. 

The combined speed and accuracy of the predicted structure 
model allowed us to run a series of controls for two important ar- 
tifacts potentially affecting deep phylogeny. The possibility of arti- 
factual attraction of exceptionally long ingroup branches to a distant 
outgroup [long-branch attraction (LBA)] is an especially important 
consideration for a rooted tree. On the basis of simple inspection, 
the multiple-excavate root does not appear to reflect a series of LBA 
artifacts for the four excavate groups, as most excavates do not have 
especially long branches relative to other eukaryotes or the outgroup 
(Fig. 1). We tested this further by deleting the euArc proteins with 
the largest ingroup-outgroup distances in their individual trees 
[single-gene trees (SGTs)] using increasingly stringent cutoffs of 
<0.9, <0.8, and < 0.7 substitutions per site (controls 1 to 3, table 
S4). All three “long-branch-depleted” datasets yield essentially the 
same topology and support values for major nodes as the full dataset 
—maximal bootstrap support for the four excavate splits (Fig. 1 and 
fig. S2). Deleting the outgroup entirely leads to no change in 
ingroup topology or support, and outgroup-free rooting using the 
nonreversible amino acid model NONREV (19) in IQTREE2 (20) 
produces a multiexcavate root but without statistical support for the 
exact placement of the root within excavates (fig. S4) (19). 

Another possible source of artifact would be the incorrect detec- 
tion of paralogs, because early eukaryote evolution involved 
rampant gene duplication, and many euArcs are multicopy in eu- 
karyotes (21). Thus, the early excavate branches seen here could 
reflect uniquely retained paralogs that were lost in other lineages, 
rather than species phylogeny. Paralogy was rigorously assessed 
throughout the initial orthology assignment phase by iterative phy- 
logenetic screening with stepwise pruning of multigene families. 
These “potential orthologs” were then further screened for any 
signs of paralogy and other potential artifacts to arrive at the 183 
euArc orthologs used to construct the rooted tree (Fig. 1 and Mate- 
rials and Methods). Moreover, deep paralogs, if any remain, should 
appear as strongly supported deep branches in SGTs, but we found 
that strongly supported deep excavate branches are rare in the euArc 
SGTs (table S4 and data $3). Nonetheless, as a further control, we 
removed all proteins for which any excavate taxon, group or sub- 
group, appears as even a moderately supported deep branch in 
their SGT, using increasingly stringent cutoff values of >70, >60, 
and >50% bootstrap support (controls 4 to 6, table S4). This also 
controls for the possibility of a few proteins with a very strong indi- 
vidual phylogenetic signal overwhelming a widespread but weaker 
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Fig. 1. A rooted phylogeny of eukaryotes based on eukaryotic proteins of archaeal descent. A concatenated alignment of 183 proteins with 45,443 aligned po- 
sitions and 85% overall data occupancy (tables $1 and S2 and data S5) was analyzed by maximum likelihood with eight different evolutionary models. The tree shown was 
derived using the deduced protein structure partitioned model (6 STR + G). Solid circles indicate nodes with 100% bootstrap support from all models and controls, and 
branch lengths are drawn to scale as indicated by the scale bar. Bootstrap support values for the major nodes are shown in the table at the top. Controls 1 to 3 used 
stepwise reduction in ingroup-outgroup distances, while controls 4 to 6 used stepwise reduction in individual tree support for early branching excavates (table $4). Taxon 
names and all bootstrap values for the 6 STR + G tree are shown in fig. S1. 
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Fig. 2. Comparative fit of evolutionary models for the six main protein structure elements. Goodness of fit based on Bayesian information criteria [BIC; (40)] was 
calculated using model fit [IQTREE.v1.6.12; (47)] on a 183-protein alignment partitioned into six predicted structural elements using NetSurfP-3.0 (16) and evaluated on 
the best global tree (Fig. 1). Bars show improvement in BIC scores for various models relative to the LG model (without gamma) and are colored according to their general 
model type as indicated by the key to the right. The analyses were run on a 20-core CPU and include optimizing branch lengths and model parameters. The numbers of 
categorical mixture components for each model and CPU time in minutes are shown at the far left of each bar, to the left and right of a slash, respectively. All phylogenetic 
analyses of the 183 euArc data with these models produce the same tree (Fig. 1 and data S2). Raw values are provided in table S5. 


opposing signal, which can occur even with large multilocus data- 
sets (22). Again, all three controls reconstruct all major eukaryote 
groups with 99 to 100% boostrap support including 100% support 
for the four excavate splits (Fig. 1 and fig. S3). 


DISCUSSION 

Phylogenomic analyses using diverse evolutionary models, includ- 
ing a fast and well-fit protein structure partition model (Fig. 2), 
place the four major lineages of excavate eukaryotes as the first 
four branches of the eukaryote tree of life (Fig. 1 and fig. $1). To- 
gether, these proteins represent the core of eukaryotic and archaeal 
information processing plus diverse other cellular processes (tables 
S2 and $3). Controls for deep phylogeny artifacts, particularly LBA 
and deep paralogy, produce no decrease in support for this topology 
(figs. S1 to S3). Thus, rooted phylogenies based on proteins of both 
archaeal (Fig. 1) and bacterial (11) descent identify excavates as the 
earliest branches of eukaryotes. The fact that the first three of these 
branches are essentially modern cells but lacking aerobic 
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mitochondria suggests, among other things, that eukaryotic com- 
plexity arose without the benefit of mitochondrial respiration. 

The euArc phylogeny is notably robust with strong resolution of 
nearly all well-established major divisions of eukaryotes (Fig. 1), 
consistent with a variety of analyses based on different genes, 
taxa, methods, and models (1, 3). The tree is also unusually symmet- 
rical, with the major groups showing roughly similar distances from 
the root. This is especially true if one disregards the two Giardia 
species (Fornicata), which have the long terminal branches typical 
of parasites. Although the tree lacks an outgroup for Archaea, the 
topology is also consistent with the Asgard archaea as sister to eu- 
karyotes (23). The one poorly resolved lineage is the Diaphoretickes 
assemblage referred to as SAR (24), which we found was also unsta- 
ble in euBac phylogeny (11). The branching order within SAR is an 
important question, given that the group includes most of the 
marine algae and major parasites of animals and crop plants. The 
reason for the group's instability in these trees is not clear but may 
be better addressed in analyses without a distant outgroup. 

All the models that we use in this study reconstruct the same 
phylogeny and with nearly identical support values for all major 
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nodes (Fig. 1). Nonetheless, the partitioned structure model has a 
number of advantages. First, it provides model fitness close to that 
of more complex models without exploding the parameter space or 
the mixture of components. This makes the model more efficient 
and avoiding the issue of circularity. Second, the current dataset 
is very robust, so that any reasonably accurate model reproduces 
the same tree (Fig. 1). However, this would not necessarily be the 
case with smaller datasets (fewer genes or taxa) or more challenging 
phylogenetic problems, e.g., very distantly or very closely related se- 
quences or taxa, where site-wise sequence information may be in- 
sufficient to accurately model protein structure. Thus, it remains to 
be determined whether the structure mixture models and structure 
partition model perform similarly with various quality data. There 
is also the question of the carbon footprint of highly demanding 
computational analyses, which are hard to justify if simpler 
models of similar accuracy are available. Perhaps simpler models 
can also be more readily combined with other models such as cor- 
rections for variation among lineages, either in terms of substitution 
rates (25) or other factors. These are likely major impediments to 
accurate phylogenetic reconstruction at many levels. Further refine- 
ment of the structure partition method might be gained by includ- 
ing weaker physical factors such as aromaticity, charge, and 
rotational isomer state (26-28). However, no fast and easy 
method that integrates all these factors is now available, and the pos- 
sible added value would need to be evaluated. 

It is important to distinguish the secondary structure partition- 
ing model that we present here from partitioning by locus (gene or 
protein). The latter is essentially partitioning by tertiary structure, 
which is a fairly rough level of partitioning and does not appear to 
contribute strongly to phylogenetic accuracy or may even hinder it 
(29). Nor is our protocol similar to the automated partitioning 
method of PartitionFinder (30), which works by stepwise merging 
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Fig. 3. A proposed stepwise scenario for the origin of mitochondria and mi- 
tochondria-like organelles. A schematic version of the rooted euArc phylogeny 
in Fig. 1 is shown with two proposed endosymbiotic events, the earlier most likely 
involving a y- and/or 5-proteobacterium (y-/5-proteo) and the second an a-proteo- 
bacterium (a-proteo). 
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of loci where such mergers lead to higher Bayesian information cri- 
terion (BIC) score. Partitioning by locus or even combined loci fun- 
damentally differs from the fine-level secondary structure 
partitioning used here. Moreover, protein secondary structure and 
solubility exposure have been explicitly shown to strongly constrain 
amino acid substitution patterns throughout the protein, unlike ter- 
tiary structure (12, 13). 

Many potentially interesting taxa are still missing from the 
current euArc tree. This is primarily because they lacked sufficient 
data for adequate prescreening at the time of these analyses. Screen- 
ing is particularly problematic for intriguing but newly described 
species where genomic data tend to be limited and/or of mixed 
quality. This makes it difficult to identify orthologs with confidence, 
which is critical here. We have also omitted phylogenetically chal- 
lenging taxa such as Cryptophytes and Haptophytes, which are 
probable members of Diaphoretickes, possibly closely related to 
Archaeplastida. While these taxa are interesting in their own 
right, they are not directly relevant to the question addressed here 
(11). We also do not include the phylogenetically problematic Ma- 
lawimonads, for which there is a lack of sufficient quality data. In 
future, it would be especially interesting to include these taxa, as 
they are often, if sporadically, assigned to Excavata. 

An anaerobic excavate root raises interesting questions regarding 
the nature of the last eukaryote common ancestor (LECA) and the 
origin of mitochondria. If LECA had a respiratory-competent mi- 
tochondrion, as is widely held, then an early ancestor of each of the 
three anaerobic excavate lineages would have had to migrate inde- 
pendently to a low-oxygen environment. Meanwhile, each lineage 
would also have had to retain at least one fully mitochondriate 
branch that remained extant long enough to give rise to the next 
surviving split in the tree. However, there is now no evidence of 
any aerobic branch in any of the three anaerobic excavate groups. 
Each anaerobic excavate lineage would also have had to indepen- 
dently reduce their mitochondrion to nonrespiring hydrogeno- 
somes or mitosomes [mitochondria-related organelles (MROs); 
(31)]. The latter at least is not, together, unlikely as multiple exam- 
ples of such reductions have been documented in other eukaryotes 
(5, 32). However, a theoretically simpler explanation would be that 
the LECA simply had an MRO, most likely a hydrogenosome (33), 
and that mitochondrial respiration arose later, sometime after the 
divergence of Preaxostyla and before the emergence of Discoba 
(Fig. 3). This would suggest that aerobic mitochondria arose by a 
separate endosymbiosis from that which gave rise to 
hydrogenosomes. 

Such a “serial endosymbiotic” scenario is consistent with the 
mixed ancestry of euBacs in general and especially mitochondrial 
proteins, only a small fraction of which trace to a-proteobacteria 
(34). A late advent of respiration is also consistent with the often 
much higher sequence conservation of eukaryotic genes of a-pro- 
teobacterial descent compared to genes tracing to other bacteria (11, 
32). A relatively late origin of respiration would also help explain the 
unique presence of notably a-proteobacterial like mitochondria 
genomes in the Jakobida [Discoba; (35)]. Given that the most 
common donors of euBacs are a-proteobacteria followed by ô-pro- 
teobacteria (32), while the second most common donor of mito- 
chondrial proteins is y-proteobacteria (11, 36), the simplest 
scenario would be a y- or 5-proteobacterial endosymbiosis followed 
by an a-proteobacterial one (Fig. 3). Alternatively, there may have 
been multiple endosymbioses of varying success both before and 
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after LECA, as previously suggested (32). It should be noted that 
these scenarios share similarities with the archaezoa hypothesis of 
Cavalier-Smith (37), although that was later abandoned after the 
discovery of MROs and tree reconstruction artifacts (38). 
Excluding taxa from the euArc tree means that we cannot rule 
out the possibility that one or more of these taxa, or other yet-un- 
discovered taxa, may represent earlier branches, given that much 
eukaryote diversity remains unknown (3, 7). However, no addition 
of taxa will change the fundamental relationship described here, i.e., 
that the earliest branches of extant eukaryotes include multiple an- 
aerobic lineages with predominantly excavate morphology. The im- 
plications of this are profound. For example, eukaryotic cellular and 
molecular complexity probably predate mitochondrial respiration. 
Modern eukaryotes could also have arisen before the great oxygen 
event (4), which is consistent with recent molecular dating (5, 39). 
Eukaryotes probably also had an excavate morphology for much of 
their early history, and this morphology may have formed the basis 
for other morphological innovations. However, it is important to 
note that an excavate morphology is so far unknown for Parabasalia, 
which were primarily assigned to Excavata based on unrooted trees. 
Thus, this enigmatic taxon may be a key to understanding eukaryote 
origins and the nature of LECA and the forces that shaped it. 


MATERIALS AND METHODS 

Initial euArc protein selection and dataset assembly 

A dataset of universal eukaryotic proteins of archaeal descent 
(euArcs) was assembled beginning with 719 proteins identified as 
shared by Archaea and Eukarya based on a cross section of the 
arCOG and KOD databases (15). Individual datasets were assem- 
bled for each of the 719 proteins by BLASTp search of the 
GenBank nr data for a taxonomically broad set of Archaea, 
Eukarya, and Bacteria. Datasets were further augmented by 
BLASTp search of in-house data for four additional excavate taxa 
and in-house assemblies of an additional 27 excavates from publicly 
available SRA files (table S1) using Trinity version 2.13.2 (40). The 
four in-house excavate transcriptomes were sequenced from mono- 
eukaryote cultures as previously described (41, 42). Taxonomic 
identifiers were prepended to sequence names as shown in table 
S1. Sequence searches used BLASTp with a relaxed e value cutoff 
of e-05 to extract all possible paralogs. The resulting sequence sets 
were then ranked on the basis of their taxonomic coverage across 
Archaea and Eukarya, and 456 protein sets were identified as 
having wide taxon coverage in both domains. 


Ortholog detection 

Each of the 456 candidate proteins was aligned individually using 
Mafft with FFT-NS-i algorithm and --maxiterate 1000 (43) and 
then trimmed using TrimAl autol to remove regions of ambiguous 
alignment (44). Each protein set was then screened for multigene 
families (paralogs) with maximum likelihood and the Shiman- 
daro-Hasegawa (SH) test as implemented in FastTree 2.1 (45). Pro- 
teins showing non- or weakly monophyletic eukaryotes were 
deleted, while proteins showing multiple, well-separated, strongly 
supported eukaryote-wide clades were broken down into single can- 
didate orthologs. Candidate orthologs were defined as highly dis- 
tinct clades (long subtending branch) that were also out-paralog- 
free and showed strong and consistent support for eukaryote mono- 
phyly (SH > 0.95). This resulted in 441 candidate orthologs. 
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Bitscores were then recalculated for each sequence for each candi- 
date ortholog by BLASTp search using two eukaryote query se- 
quences from within each ortholog candidate against the rest of 
its sequences with the highest score retained. Each candidate ortho- 
log was further screened for in-paralogs by iterative phylogenetic 
analysis using IQTREE (46), with command arguments as indicated 
in data S4. In-paralogs, i.e., multiple copies in the same or closely 
related taxa (same genus), were reduced to a single sequence accord- 
ing to the following criteria: (i) lowest recalculated bitscore, (ii) 
fewest gaps after trimming, and/or (iii) shortest terminal branch 
in their IQTREE inference. 


Ortholog assessment 

Candidate orthologs were further subjected to multiple rounds of 
screening and filtering by phylogenetic inference using IQTREE 
with LG + G4 and ultrafast bootstraps. This resulted in the elimina- 
tion of proteins that failed one or more of the following criteria: (i) 
failure to support eukaryote monophyly with >90% bootstrap 
support, (ii) Eukarya closer to Bacteria than Archaea, (iii) strong 
disagreement (>70% bootstrap) with accepted eukaryote phyloge- 
ny, i.e., one or more taxa falling within a noncanonical major clade 
[listed as kingdoms in table S1 and (3)], (iv) lack of taxa for more 
than two major subgroups of Archaea, (v) lack of taxa for any of the 
eukaryote superkingdoms [table S1 and (3)], or (vi) one or more 
major groups showing overall substantially lower bitscores than 
the remaining eukaryotes groups. 

The final rounds of assessment were conducted using IQTREE 
as above, after removing the bacterial sequences. Any remaining eu- 
karyotic within-kingdom paralogs (in-paralogs) in terminal or 
near-terminal clades in the resulting trees were reduced to the 
single shortest-branched sequence. In the case of any remaining 
nonterminal paralogs, the protein in question was deemed as a 
suspect of hidden paralogy and discarded. Sequences with apparent 
xenologs, contaminant data, or sequences producing excessively 
long branches were also removed. This was followed by a repeat 
round of screening for the five criteria listed above. The final 
result is a set of 183 individual ortholog datasets. 

A final set of alignments was produced for each of the 183 ortho- 
logs using mafft with L-INS-i algorithm --maxiterate 1000 (43) and 
trimmed using TrimAl autol to remove regions of ambiguous 
alignment (44). The alignments were then further trimmed to 
remove alignment columns with more than 80% missing data in 
either Eukarya or Archaea and subjected to maximum likelihood 
and rapid bootstraps analysis with RAxML (47) under the 
LGPROTCAT model. The resulting final set of individual trees 
(SGTs; data S3) was used as the basis for six control analyses 
(table S4). Command arguments for tree inferences are given in 
data S4. 


Phylogenetic inference, models, and model fitness 

Maximum likelihood trees were constructed for the full concatenat- 
ed 183 protein alignment (supermatrix) with and without data par- 
titioning. Unpartitioned supermatrix analyses were conducted 
using five method-model combinations as follows: (i) LG model 
with four gamma rate categories using RAxML with rapid boot- 
straps (47) and (ii) IQTREE (46) with LG4X model (48) and ultra- 
fast bootstraps (49); (iii) C20 and (iv) C60 profile mixture models 
(17) using IQTREE with the LG + C20 + G4 + F and LG + C60 + G4 
models, respectively, and ultrafast bootstraps; and (v) protein 
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structure mixture models with EX_EHO + G4, EX2 + G4, 
EHO + G4 (12) using IQTREE and ultrafast bootstraps. Trees are 
summarized in Fig. 1 and provided in newick format in data S2. 
All IQTREE ultrafast bootstraps used the optimization flag —bnni 
for a more thorough search of tree space. 


Protein structure partition model 

Protein structure partitioning consists of five steps: (i) structure pre- 
diction, (ii) structure consensus calculation, (iii) consensus 
mapping onto trimmed alignments, (iv) alignment partitioning 
by consensus structure, and (v) phylogenetic analysis. Plurality con- 
sensus structures are determined for each protein with NetSurfP-3.0 
(16) using full sequences for 10 species from across the tree. The 10 
predicted structures are then used to calculate a plurality consensus 
predicted structure site-wise for each protein, which is then mapped 
onto the full trimmed alignment for that protein. Last, each protein 
alignment is partitioned into the six predicted structure categories 
—buried and exposed helices, sheets, and loops—and subjected to 
phylogenetic analysis using corresponding structure-based ex- 
changeability matrices (12). 

For the 183 euArc proteins, secondary structures were predicted 
for four Archaea and six eukaryotes, with taxa selected on the basis 
of least amount of gap positions in the trimmed alignments (data 
S1). In terms of predicted structure, the resulting consensus align- 
ments show 95% match of the individual predicted structures to the 
consensus, and 80% of the alignment columns have strict consen- 
sus. For solubility exposure, 93% of residues match the consensus 
and 74% of columns have strict consensus (data S1). Phylogenetic 
analyses of the partitioned data were run using six structure-based 
exchangeability matrices and frequencies (12) in IQTREE 1.6.12 
with the -spp option, which assigns separate gamma rate categories 
to each partition. Ultrafast bootstrap analysis (-bb 1000) was opti- 
mized as above. The model was used for analyses of the full dataset 
and six control analyses (table S4). The resulting trees are provided 
in newick format in data S2. 


Comparison of model fitness, complexity, and run times 
Model fitness comparison tests were performed for each of the six 
protein structure categories using IQTREE model finder (50) and 
the BIC scores. Run times are the CPU time in minutes required 
to optimize the likelihood and branch lengths of the best global 
tree (Fig. 1) under each model as reported in the IQTREE log file 
for each run. Raw fitness scores for BIC, Akaike information crite- 
rion (AIC), AIC corrected for small sample size (AICc), and the log- 
likelihood values and CPU time are provided in table S5. 


Correction (22 May 2023): Due to a production error, supplemental data files $1-S5 were not 
included with this paper at the time of publication and supplemental data file S6 was 
incomplete. These files have been uploaded and are now available with the supplementary 
materials. 


Supplementary Materials 
This PDF file includes: 

Figs. S1 to S4 

Legends for tables S1 to $5 

Legends for data S1 to S6 


Other Supplementary Material for this 
manuscript includes the following: 
Tables S1 to S5 
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Data S1 to S6 
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